I. Introduction

Zillow has realized that its housing market predictions are not as accurate as they could be because they do not factor in enough local intelligence. As such they have asked us to build a better predictive model of home prices for Mecklenberg County, NC. This model aims to build an accurate and generalizable hedonic model that predicts home prices for Mecklenberg County, NC by deconstructing overall home price into the value of constituent parts. Accurate models lead to only small differences between the predicted and observed values, and generalizable models accurately predict on new data and with comparable accuracy across various groups. By working to improve the accuracy and generalizability of the predictive model, we are ultimately striving to create a more useful decision-making tool. Such a model may be useful for local governments as they assess property taxes, for example.

II. Data Manipulation and Visualization

2.0 Setup

In this section, we loaded necessary libraries, created plot theme options and map theme options, and identified functions of quintile breaks, and average nearest neighbor distance for further analysis.

# Load some libraries

library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot) # plot correlation plot
library(corrr)      # another way to plot correlation plot
library(kableExtra)
library(jtools)     # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr)    # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(tidycensus)
library(geojsonio)
library(stargazer)
library(utils)
library(rgdal)
#library(plyr)

# functions and data directory
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")

2.1 Data Wrangling

In this section, we loaded Mecklenburg County data and other complementary datasets and conducted feature selection and engineering. Besides, for other analyses, we separate the whole dataset into and in advance of Mecklenburg.

Five categorical datasets were used in this project: Mecklenburg County House Price and Internal Characteristics: basic geometric dataset provided by the course in advance.

Mecklenburg County Base Data: geometric data containing the geographic information of Mecklenburg County. Mecklenburg County Beach Neighborhood: As an alternative way to using neighborhood data of Mecklenburg County Beach, we used Mecklenburg County Municipal Boundary data that is a polygon feature class of municipal boundaries within Mecklenburg County-Dade County, and specifically filter geometric data in Mecklenburg County Beach.

Census Data: demographic variables from the ACS 2017 for census tracts in Mecklenburg County. We specifically selected 8 variables in the analysis:

• TotalPop: Total population in each census tract
• Whites: White population in each census tract
• MedHHInc: Median household income in each census tract
• MedRent: Median Rent for properties in each census tract
• TotalPoverty: Population living under the level of poverty in each census tract
• TotalUnit: Total housing units in each census tract
• pctWhite: white population proportion in each census tract
• pctBachelors: Bachelor population proportion in each census tract
• pctPoverty: Poverty population proportion in each census tract
• pctTotalVacant: Vacant unites proportion in each census tract

Amenity Data: most of the data come from the website of government: http://maps.co.mecklenburg.nc.us/openmapping/data.html.Compilatory data were chosen for describing amenities and public services of Mecklenburg County. The datasets we chose encompasses fireplace, police station, education opportunity, healthcare service, and entertainment accessibility. See each dataset below:

1. Public School: Dataset contains points representing Elementary, Middle, and High Schools in Mecklenburg County.
2. Bus station: A point feature class of CATS Bus Stops for Mecklenburg County.
3. Fire station : Geographical representation of the physical locations of the Charlotte Fire Department Fire Stations.
4. Church : Churches dataset defines locations classified as religious organizations in Mecklenburg County, data is queried from Master Address Table.
5. Medical Facilities: This dataset was generated to publish a geospatial inventory of Medical Facilities in Mecklenburg County, NC. This data serves to support and assist the Charlotte Mecklenburg Planning Department, The City of Charlotte, and others in Urban Planning decisions.
6. Park : Park amenities for all parks in Mecklenburg County, including Charlotte, Davidson, Cornelius, Davidson, Huntersville, Matthews, Mint Hill, and Pineville.
7. Police: Charlotte-Mecklenburg Police Department Stations.

primary.data<-
  st_read("https://raw.githubusercontent.com/mafichman/MUSA_508_Lab/main/Midterm/data/2022/studentData.geojson")%>%
  st_transform('EPSG:2264')
## Reading layer `studentData' from data source 
##   `https://raw.githubusercontent.com/mafichman/MUSA_508_Lab/main/Midterm/data/2022/studentData.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 46183 features and 71 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.05626 ymin: 35.01345 xmax: -80.55941 ymax: 35.51012
## Geodetic CRS:  WGS 84
#school data
cmpd_school<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CMS_Schools.geojson') %>%
  st_transform('EPSG:2264')
## Reading layer `CMS_Schools' from data source 
##   `https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CMS_Schools.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 198 features and 12 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1394092 ymin: 468361 xmax: 1514645 ymax: 638289
## Projected CRS: NAD83 / North Carolina (ftUS)
# firestation
cmpd_firestation<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CFD_FireStations.geojson') %>%
  st_transform('EPSG:2264')
## Reading layer `CFD_FireStations' from data source 
##   `https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CFD_FireStations.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 43 features and 18 fields
## Geometry type: POINT
## Dimension:     XYZ
## Bounding box:  xmin: -81.00421 ymin: 35.04484 xmax: -80.66572 ymax: 35.37415
## z_range:       zmin: -3.505126e-05 zmax: -3.448315e-05
## Geodetic CRS:  WGS 84
# church
cmpd_church<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/Churches.geojson') %>%
  st_transform('EPSG:2264')
## Reading layer `Churches' from data source 
##   `https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/Churches.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 880 features and 44 fields
## Geometry type: POINT
## Dimension:     XYZ
## Bounding box:  xmin: -81.03116 ymin: 35.018 xmax: -80.5799 ymax: 35.50598
## z_range:       zmin: -3.527943e-05 zmax: -3.443845e-05
## Geodetic CRS:  WGS 84
# medical facilities
cmpd_medical<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/MedicalFacilities.geojson') %>%
  st_transform('EPSG:2264')
## Reading layer `MedicalFacilities' from data source 
##   `https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/MedicalFacilities.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 906 features and 9 fields
## Geometry type: POINT
## Dimension:     XYZ
## Bounding box:  xmin: -80.99421 ymin: 35.01552 xmax: -80.63303 ymax: 35.50797
## z_range:       zmin: -3.528316e-05 zmax: -3.443379e-05
## Geodetic CRS:  WGS 84
# park
cmpd_park<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/Park_Locations.geojson') %>%
  st_transform('EPSG:2264')
## Reading layer `Park_Locations' from data source 
##   `https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/Park_Locations.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 346 features and 97 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1394440 ymin: 467223.2 xmax: 1523569 ymax: 644298.8
## Projected CRS: NAD83 / North Carolina (ftUS)
# police
cmpd_police<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CMPD_Police_Offices.geojson') %>%
  st_transform('EPSG:2264')
## Reading layer `CMPD_Police_Offices' from data source 
##   `https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CMPD_Police_Offices.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 15 features and 8 fields
## Geometry type: POINT
## Dimension:     XYZ
## Bounding box:  xmin: -80.94242 ymin: 35.04139 xmax: -80.72359 ymax: 35.34361
## z_range:       zmin: -3.49991e-05 zmax: -3.448036e-05
## Geodetic CRS:  WGS 84
# bus
cmpd_busstation<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CATS_BusStops.geojson') %>%
  st_transform('EPSG:2264')
## Reading layer `CATS_BusStops' from data source 
##   `https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CATS_BusStops.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 2992 features and 16 fields
## Geometry type: POINT
## Dimension:     XYZ
## Bounding box:  xmin: -81.17988 ymin: 34.92704 xmax: -80.55811 ymax: 35.50915
## z_range:       zmin: -3.528502e-05 zmax: -3.428385e-05
## Geodetic CRS:  WGS 84
#neighborhood
neighborhood <-
  st_read('https://raw.githubusercontent.com/RumRon/HW_UP_IP/main/Midterm/geojson/VFD_FireDistricts.geojson')%>%
  st_transform('EPSG:2264')
## Reading layer `VFD_FireDistricts' from data source 
##   `https://raw.githubusercontent.com/RumRon/HW_UP_IP/main/Midterm/geojson/VFD_FireDistricts.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 143 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -81.05821 ymin: 35.00217 xmax: -80.55011 ymax: 35.51917
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
police.sf <-
  cmpd_police %>%
    dplyr::select(name) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

bus.sf<-
  cmpd_busstation %>%
    dplyr::select(StopID) %>%
    na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

church.sf<-
  cmpd_church %>%
    dplyr::select(num_parent) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()


firestation.sf<-
  cmpd_firestation %>%
    dplyr::select(NAME) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

medical.sf<-
  cmpd_medical %>%
    dplyr::select(Name) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

park.sf<-
  cmpd_park %>%
    dplyr::select(prkname) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

school.sf<-
  cmpd_school %>%
    dplyr::select(school) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

nhood.sf<-
  neighborhood %>%
  dplyr::select(vfd) %>%
  #filter(name != NA)%>%
  st_zm(drop = TRUE) %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
  st_transform('EPSG:2264') %>%
  distinct()
mklb.sf <- 
  mklb.sf %>% 
     mutate(
       police_nn1 = nn_function(st_coordinates(mklb.sf), 
                                st_coordinates(police.sf), k = 1),
       police_nn2 = nn_function(st_coordinates(mklb.sf), 
                                st_coordinates(police.sf), k = 2),
       police_nn3 = nn_function(st_coordinates(mklb.sf), 
                                st_coordinates(police.sf), k = 3),
       school_nn1 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(school.sf), k = 1),
       school_nn2 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(school.sf), k = 2),
       school_nn3 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(school.sf), k = 3),
       park_nn1 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(park.sf), k = 1),
       park_nn2 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(park.sf), k = 2),
       park_nn3 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(park.sf), k = 3),
       firestation_nn1 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(firestation.sf), k = 1),
       firestation_nn2 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(firestation.sf), k = 2),
       firestation_nn3 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(firestation.sf), k = 3),
       church_nn1 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(church.sf), k = 1),
       church_nn2 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(church.sf), k = 2),
       church_nn3 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(church.sf), k = 3),
       medical_nn1 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(medical.sf), k = 1),
       medical_nn2 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(medical.sf), k = 2),
       medical_nn3 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(medical.sf), k = 3))

mklb.sf <-st_join(mklb.sf, nhood.sf)
# acs
acs_variable_list.20 <- load_variables(2020, #year
                                         "acs5", #five year ACS estimates
                                         cache = TRUE)
tracts20 <- 
  get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E","B15001_050E",
                                             "B15001_009E","B19013_001E","B25058_001E",
                                             "B06012_002E","B28010_007E","B08101_001E",
                                             "B09001_001E","B09001_003E","B09021_002E",
                                             "B11001I_001E", "B14001_009E",
                                             "B17001_002E","B27001_001E","B18101_001E",
                                             "B19001_001E","B25001_001E","B25040_001E"), 
          year=2020, state=37, county=119, geometry=T, output="wide") %>%
  st_transform('EPSG:2264') %>%
  mutate(TotalPop = B25026_001E, 
         Whites = B02001_002E,
         FemaleBachelors = B15001_050E, 
         MaleBachelors = B15001_009E,
         MedHHInc = B19013_001E, 
         MedRent = B25058_001E,
         TotalPoverty = B06012_002E,
         Nocom = B28010_007E, 
         Waytowork = B08101_001E,
         Popunder18 = B09001_001E, 
         Popunder3 = B09001_003E,
         Singleadult = B09021_002E, 
         Householdtype = B11001I_001E,
         Addmittogra = B14001_009E,
         Poverty  = B17001_002E,
         Healthins  = B27001_001E,
         Disable  = B18101_001E,
         Familyincome  = B19001_001E,
         Housingunits  = B25001_001E,
         Househeatingfuel  = B25040_001E)%>%
  mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop,0),
         pctBachelors = ifelse(TotalPop > 0, ((FemaleBachelors + MaleBachelors) / TotalPop),0),
         pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
         year = "2020") 
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |==                                                                    |   4%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=====================                                                 |  31%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |============================                                          |  41%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===================================                                   |  51%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |==========================================                            |  61%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%

2.2 Summary Statistics

Present a table of summary statistics with variable descriptions. We sort these variables by their category (internal characteristics, amenities/public services or spatial structure).

#internal
factor_internal <- mklb.sf%>%
  dplyr::select(commonpid,price,age,storyheigh,heatedarea,numfirepla,totalac,
                fullbaths,halfbaths,bedrooms,units,prisqft) %>%
  filter(prisqft!= 'Inf') 

factor_internal.sf <- factor_internal

factor_internal <-
  select_if(st_drop_geometry(factor_internal), is.numeric)  %>% na.omit()
#amentities
factor_amenity <- mklb.sf%>%
  dplyr::select(price,prisqft,police_nn1, police_nn2, police_nn3, school_nn1, school_nn2,  school_nn3, church_nn1, church_nn2, church_nn3, park_nn1, park_nn2, park_nn3,firestation_nn1, firestation_nn2, firestation_nn3, medical_nn1, medical_nn2, medical_nn3,police_nn1, police_nn2, police_nn3)

factor_amenity <- factor_amenity%>%filter(prisqft!= 'Inf')
  # dplyr::select(-commonpid,-age,-storyheigh,-heatedarea,-numfirepla,-totalac,
  #                -fullbaths,-halfbaths,-bedrooms,-units)

factor_amenity.sf <- factor_amenity
factor_amenity <-
  select_if(st_drop_geometry(factor_amenity), is.numeric)  %>% na.omit()
#ACS
ACS1 <- mklb.sf%>%
  dplyr::select(price,prisqft)

ACS2 <- tracts20%>%
  dplyr::select(pctWhite, pctBachelors, pctPoverty, TotalPop, MedHHInc, Singleadult, Healthins, Familyincome, Housingunits, Househeatingfuel)

factor_acs <-st_join(ACS1, ACS2) #%>%
  #distinct(geometry, .keep_all = TRUE)

all.sf <-st_join(mklb.sf, ACS2)

factor_acs <-factor_acs%>%filter(prisqft!= 'Inf')
factor_acs.sf <- factor_acs
factor_acs <-
  select_if(st_drop_geometry(factor_acs), is.numeric) %>% na.omit()


star_internal <- st_drop_geometry(factor_internal)
star_amenity <- st_drop_geometry(factor_amenity)
star_acs <- st_drop_geometry(factor_acs)




star_internal <- st_drop_geometry(factor_internal)
star_amenity <- st_drop_geometry(factor_amenity)
star_acs <- st_drop_geometry(factor_acs)

stargazer(star_internal, type = "html", 
          title = "Table DATA 2.1 Summary Statistics of Internal Characteristics ",
          header = FALSE,
          #out = "try.html",
          single.row = TRUE)
## 
## <table style="text-align:center"><caption><strong>Table DATA 2.1 Summary Statistics of Internal Characteristics</strong></caption>
## <tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Max</td></tr>
## <tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">price</td><td>44,596</td><td>462,928.900</td><td>377,354.600</td><td>0</td><td>7,700,000</td></tr>
## <tr><td style="text-align:left">age</td><td>44,596</td><td>27.656</td><td>25.838</td><td>0</td><td>2,022</td></tr>
## <tr><td style="text-align:left">storyheigh</td><td>44,596</td><td>1.646</td><td>0.484</td><td>1.000</td><td>3.000</td></tr>
## <tr><td style="text-align:left">heatedarea</td><td>44,596</td><td>2,367.630</td><td>1,070.427</td><td>306.000</td><td>14,710.000</td></tr>
## <tr><td style="text-align:left">numfirepla</td><td>44,596</td><td>0.787</td><td>0.466</td><td>0</td><td>7</td></tr>
## <tr><td style="text-align:left">totalac</td><td>44,596</td><td>1.826</td><td>108.598</td><td>0.000</td><td>9,757.440</td></tr>
## <tr><td style="text-align:left">fullbaths</td><td>44,596</td><td>2.284</td><td>0.829</td><td>0</td><td>9</td></tr>
## <tr><td style="text-align:left">halfbaths</td><td>44,596</td><td>0.602</td><td>0.527</td><td>0</td><td>4</td></tr>
## <tr><td style="text-align:left">bedrooms</td><td>44,596</td><td>3.512</td><td>0.950</td><td>0</td><td>65</td></tr>
## <tr><td style="text-align:left">units</td><td>44,596</td><td>0.976</td><td>0.199</td><td>0</td><td>8</td></tr>
## <tr><td style="text-align:left">prisqft</td><td>44,596</td><td>195.697</td><td>107.364</td><td>0.000</td><td>5,080.808</td></tr>
## <tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr></table>
stargazer(star_amenity, type = "html",
          title = "Table DATA 2.2 Summary Statistics of Amenities Characteristics ",
          header = FALSE,
          #out = "try.html",
          single.row = TRUE)
## 
## <table style="text-align:center"><caption><strong>Table DATA 2.2 Summary Statistics of Amenities Characteristics</strong></caption>
## <tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Max</td></tr>
## <tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">price</td><td>46,201</td><td>461,394.900</td><td>373,709.800</td><td>0</td><td>7,700,000</td></tr>
## <tr><td style="text-align:left">prisqft</td><td>46,201</td><td>195.586</td><td>106.372</td><td>0.000</td><td>5,080.808</td></tr>
## <tr><td style="text-align:left">police_nn1</td><td>46,201</td><td>19,759.000</td><td>11,646.630</td><td>232.622</td><td>60,796.570</td></tr>
## <tr><td style="text-align:left">police_nn2</td><td>46,201</td><td>24,973.550</td><td>13,360.550</td><td>1,689.842</td><td>72,490.110</td></tr>
## <tr><td style="text-align:left">police_nn3</td><td>46,201</td><td>28,464.720</td><td>14,531.070</td><td>5,592.504</td><td>79,044.680</td></tr>
## <tr><td style="text-align:left">school_nn1</td><td>46,201</td><td>4,763.712</td><td>3,075.632</td><td>116.992</td><td>22,481.340</td></tr>
## <tr><td style="text-align:left">school_nn2</td><td>46,201</td><td>5,883.655</td><td>3,092.816</td><td>116.992</td><td>22,972.960</td></tr>
## <tr><td style="text-align:left">school_nn3</td><td>46,201</td><td>6,949.664</td><td>3,335.845</td><td>634.527</td><td>24,897.730</td></tr>
## <tr><td style="text-align:left">church_nn1</td><td>46,201</td><td>2,932.691</td><td>2,097.287</td><td>17.465</td><td>14,635.710</td></tr>
## <tr><td style="text-align:left">church_nn2</td><td>46,201</td><td>3,469.833</td><td>2,208.135</td><td>42.527</td><td>16,387.490</td></tr>
## <tr><td style="text-align:left">church_nn3</td><td>46,201</td><td>3,929.411</td><td>2,308.243</td><td>113.048</td><td>17,229.310</td></tr>
## <tr><td style="text-align:left">park_nn1</td><td>46,201</td><td>4,069.461</td><td>2,518.566</td><td>68.884</td><td>16,490.430</td></tr>
## <tr><td style="text-align:left">park_nn2</td><td>46,201</td><td>4,977.050</td><td>2,588.607</td><td>205.557</td><td>16,614.240</td></tr>
## <tr><td style="text-align:left">park_nn3</td><td>46,201</td><td>5,711.460</td><td>2,679.841</td><td>294.621</td><td>18,545.810</td></tr>
## <tr><td style="text-align:left">firestation_nn1</td><td>46,201</td><td>11,730.960</td><td>11,229.980</td><td>143.254</td><td>63,000.030</td></tr>
## <tr><td style="text-align:left">firestation_nn2</td><td>46,201</td><td>14,871.700</td><td>11,175.460</td><td>2,031.637</td><td>64,225.180</td></tr>
## <tr><td style="text-align:left">firestation_nn3</td><td>46,201</td><td>17,164.520</td><td>11,329.790</td><td>3,618.091</td><td>66,259.220</td></tr>
## <tr><td style="text-align:left">medical_nn1</td><td>46,201</td><td>5,652.947</td><td>4,373.851</td><td>37.400</td><td>24,931.340</td></tr>
## <tr><td style="text-align:left">medical_nn2</td><td>46,201</td><td>6,285.044</td><td>4,412.851</td><td>124.004</td><td>25,453.750</td></tr>
## <tr><td style="text-align:left">medical_nn3</td><td>46,201</td><td>6,749.224</td><td>4,480.915</td><td>147.203</td><td>26,588.270</td></tr>
## <tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr></table>
stargazer(star_acs, type = "html",
          title = "Table DATA 2.3 Summary Statistics of ACS ",
          header = FALSE,
          #out = "try.html",
          single.row = TRUE)
## 
## <table style="text-align:center"><caption><strong>Table DATA 2.3 Summary Statistics of ACS</strong></caption>
## <tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Max</td></tr>
## <tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">price</td><td>46,198</td><td>461,403.800</td><td>373,718.800</td><td>0</td><td>7,700,000</td></tr>
## <tr><td style="text-align:left">prisqft</td><td>46,198</td><td>195.587</td><td>106.376</td><td>0.000</td><td>5,080.808</td></tr>
## <tr><td style="text-align:left">pctWhite</td><td>46,198</td><td>0.571</td><td>0.270</td><td>0.011</td><td>1.710</td></tr>
## <tr><td style="text-align:left">pctBachelors</td><td>46,198</td><td>0.011</td><td>0.013</td><td>0.000</td><td>0.314</td></tr>
## <tr><td style="text-align:left">pctPoverty</td><td>46,198</td><td>0.091</td><td>0.082</td><td>0.000</td><td>0.614</td></tr>
## <tr><td style="text-align:left">TotalPop</td><td>46,198</td><td>4,030.445</td><td>1,309.832</td><td>790</td><td>7,703</td></tr>
## <tr><td style="text-align:left">MedHHInc</td><td>46,198</td><td>86,205.120</td><td>37,418.970</td><td>17,685</td><td>238,750</td></tr>
## <tr><td style="text-align:left">Singleadult</td><td>46,198</td><td>396.728</td><td>217.725</td><td>29</td><td>1,202</td></tr>
## <tr><td style="text-align:left">Healthins</td><td>46,198</td><td>4,049.650</td><td>1,311.905</td><td>790</td><td>7,703</td></tr>
## <tr><td style="text-align:left">Familyincome</td><td>46,198</td><td>1,503.466</td><td>440.230</td><td>237</td><td>2,659</td></tr>
## <tr><td style="text-align:left">Housingunits</td><td>46,198</td><td>1,598.289</td><td>469.749</td><td>249</td><td>2,778</td></tr>
## <tr><td style="text-align:left">Househeatingfuel</td><td>46,198</td><td>1,503.466</td><td>440.230</td><td>237</td><td>2,659</td></tr>
## <tr><td colspan="6" style="border-bottom: 1px solid black"></td></tr></table>
topredict <-
  all.sf%>%
  filter(toPredict!="CHALLENGE") %>%
  filter(prisqft!= 'Inf')

topredict_num <-
  select_if(st_drop_geometry(topredict), is.numeric) %>% na.omit() 

challenge <-
  all.sf%>%
  filter(toPredict=="CHALLENGE")

challenge_num<-
  select_if(st_drop_geometry(challenge), is.numeric) %>% na.omit()  

Present a table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the stargazer package for this.

2.3 Correlation Matrix

For better visualization, features were divided into internal characteristics and amenity again when creating correlation matrix. Based on the graduated color, some features have strong relation with Sale Price, such as bathroom, fireplace and house age.

Internal characteristics:

# #internal features
# ggcorrplot(
#   round(cor(factor_internal), 1),
#   p.mat = cor_pmat(factor_internal),
#   colors = c('#05A167', "white", '#6897BB'),
#   type="lower",
#   insig = "blank") +
#     labs(title = "Correlation across numeric variables\n(internal features)\n",
#        caption = 'Figure DATA 2.2') +
#     plotTheme()
# 
# #amenity features
# factor_amenity <-factor_amenity%>%filter(prisqft != 'Inf')
# ggcorrplot(
#   round(cor(factor_amenity), 1),
#   p.mat = cor_pmat(factor_amenity),
#   colors = c('#05A167', "white", '#6897BB'),
#   type="lower",
#   insig = "blank") +
#     labs(title = "Correlation across numeric variables\n(Amenity)\n",
#        caption = 'Figure DATA 2.3') +
#     plotTheme()
# 
# ggcorrplot(
#   round(cor(factor_acs), 1),
#   p.mat = cor_pmat(factor_acs),
#   colors = c('#05A167', "white", '#6897BB'),
#   type="lower",
#   insig = "blank") +
#     labs(title = "Correlation across numeric variables\n(ACS)\n",
#        caption = 'Figure DATA 2.4')+
#     plotTheme()

factor_internal %>%
  correlate() %>%
  autoplot()+
  geom_text(aes(label = round(r,digits=2)),size = 2)

Amenity :There is also a strong connection between house price and public facilities. Generally speaking, if the public facilities are better, the house price will also increase.

factor_acs %>%
  correlate() %>%
  autoplot() +
  geom_text(aes(label = round(r,digits=2)),size = 2)

ACS data :Some of the variables in the ACS data are also closely related to house prices. For example, in this matrix price per square feet seems to be related with percentage of white.

factor_acs %>%
  correlate() %>%
  autoplot() +
  geom_text(aes(label = round(r,digits=2)),size = 2)

2.4 Home prices scatterplots

We present 4 home price correlation scatterplots that we think are of interest, age, bedrooms, living area and single adult. We found a positive correlation between the number of rooms, livingarea and room rates, which is also consistent with common sense. Age and single adults are negatively correlated with house prices, but the coefficients are not large. Age may be due to the fact that buildings built earlier tend to be in low quality.

## Home Features cor
st_drop_geometry(all.sf) %>% 
  mutate(LivingArea = heatedarea) %>%
  dplyr::select(price, LivingArea, age,bedrooms, Singleadult) %>%
  filter(price <= 1000000, age < 500) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 4, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     plotTheme()

2.5 Map Home Prices

maphomeprice <-topredict%>%filter(prisqft!= 'Inf')
ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = maphomeprice, aes(colour = q5(prisqft)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(maphomeprice,"prisqft"),
                   name="Quintile\nBreaks") +
  labs(title="Price Per Square Foot, Mecklenburg County") +
  mapTheme()

Develop 1 map of your dependent variable (sale price)

# # Mapping data
# ggplot() +
#   geom_sf(data = st_union(tracts20), fill = "grey50") +
#    geom_sf(data = topredict%>%filter(prisqft!= 'Inf'), aes(fill = (prisqft)), 
#           show.legend = "point", size = .25) +
#    
#     # scale_colour_manual(values = palette5,
#     #                labels=qBr(topredict,"PricePerSq"),
#     #                name="Quintile\nBreaks") +
#   scale_fill_viridis_c() +
#   labs(title="try") +
#   mapTheme()

2.6 Three interesting maps

By plotting the matrix, we choose some variables with high positive or negative correlation. Percentage of white: The first independent variables we choose is the percentage of white.As the matrix shows, percentage of white and house prices are more positively correlated.We found that both the southern and northern parts of Mecklenburg County have a higher percentage of whites, basically greater than 75%, and this all corresponds to the higher prices in the previous house price map. In contrast, the places with lower house prices have a lower percentage of whites, even below 25%.

# plot map of white

ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = factor_acs.sf%>%filter(pctWhite<1), aes(colour = pctWhite),
          show.legend = "point", size = .75) +
  #scale_colour_manual(values = palette5,
                   #labels=qBr(maphomeprice,"prisqft"),
                  #name="Quintile\nBreaks") +
  labs(title="Percentage of white, Mecklenburg County") +
  mapTheme()

Number of fullbaths in houses: The second variable is Housing the number of full baths.We chose this variable because they have a strong correlation in the matrix, which is also in line with the general common sense inference. We can see that there are indeed more toilets in the south where the prices are higher as well.

# plot map of age
factor_internal.sf<-factor_internal.sf%>%filter(age != 2022)
ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = factor_internal.sf, aes(colour = fullbaths),
          show.legend = "point", size = .75) +
  #scale_colour_manual(values = palette5,
                   #labels=qBr(maphomeprice,"prisqft"),
                   #name="Quintile\nBreaks") +
  labs(title="Number of fullbaths in houses in Mecklenburg County") +
  mapTheme()

the average distance to nearest park: The third independent variables is the average distance to nearest park. To our surprise, most of the amenities have negative correlation with the house price, so we choose park as an example. Perhaps the fact that the parks are mostly located at the edge of the city where traffic is difficult has affected the surrounding property prices.

# plot map of park

ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = factor_amenity.sf, aes(colour = park_nn1),
          show.legend = "point", size = .75) +
  #scale_colour_manual(values = palette5,
                   #labels=qBr(maphomeprice,"prisqft"),
                   #name="Quintile\nBreaks") +
  labs(title="Distribution of average distance to\nnearest park in Mecklenburg County") +
  mapTheme()

## 2.7 Bonus for something extra engaging hospital: We believe that health care resources are also an important factor in housing prices. Although the correlation is not high in the matrix, it is possible that it is influenced by hospital quality. We still find that areas with higher house prices in the south are closer to hospitals.

hospital_nn1_expanded <- topredict %>%
  mutate(hospital_nn1.expanded = medical_nn1 * 1000)

ggplot() + 
  geom_sf(data = st_union(tracts20),fill = '#E5E5E5') +
  geom_sf(data = st_centroid(hospital_nn1_expanded),aes(color = hospital_nn1.expanded),size = .5) +
  # scale_color_manual(values = palette,
  #                    labels = qBr(hospital_nn1_expanded,'hospital_nn1.expanded'),
  #                    name = "Average Distrance\nto Nearest Hospital\n(timed by 1000 \nfor better visualization)\n") +
  labs(title = 'Distribution of Average Distrance to Nearest Hospital\nof Each Home Sale Observation\n') +
  mapTheme() + 
  plotTheme()

III Methods

Method: The mean objective of this analysis is to train a robust model to predict house price as accurately as possible. Accordingly, this section started to provide a detailed interpretation on the process of the model building and training.

Here are three models used in this part.
model1: the primitive model, where all the facotrs are in.
model2: the model after training, with the facotrs selected.
model3: the model to figure out the ‘challenge’

data,model, feature, result(map). why you interprete it

• Data and Sample: In order to accurately build and test models, we randomly split the whole into two parts: with 60% of observations for model training and with 40% of observations for model testing.
• Dependent Variable: The dependent variable is home sale price: Price
• Independent Variables: Based on the hedonic home price model, potential features should encompass at least three components: internal characteristics, like the number of bathrooms; neighborhood amenities/public services, like hospital accessibility; and the underlying spatial process of prices.Plus, we also take the ACS data into consideration.

After creating features and engineering and checking correlation coefficients(in model1), 38 features are eligible for the model and obviously or slightly correlated with that were the initial independent variables for model building. In further modifications of the model, we mainly considered p-values in determining the correlation of variables and removed the values with larger p-values:totalac, park_nn1, park_nn2, firestation_nn1, medical_nn1, TotalPop. We gradually screened out variables and eventually selected only 32 robust features included in the final model model2.

• Procedure: The primitive approach used in this analysis for model building is Ordinary Least Squares Regression (OLS) which is aimed at predicting the of houses as a function of several statistical parameters such as intercept, slope and selected features.

In order to determine which features were able to significantly increase the accuracy and generalizability of the model, we systematically compared, plotted, and mapping the mean absolute errors (MAE), mean absolute percent errors (MAPE), as well as root mean square errors (RMSE, the standard deviation of the residuals) between original and predicted of both training set and test set, conducted the cross-validation on 100 folds, explored the underlying spatial pattern of price and errors caused by problematic prediction. To address the problem of neglecting the underlying spatial correlation caused by neighborhood, we then compared, plotted, and mapped the mean absolute errors (MAE), and mean absolute percent errors (MAPE) on the test set between baseline model without neighborhood features and updated model with neighborhood features, and then determined our final model based on series model examinations.

In this process include:

IV. Result

4.1 Dataset Splitting and Model Building

Firstly, as mentioned above, we randomly split the variables into 60% and 40%. Secondly,we built three models with different features to examine which features should be included.

model1: the primitive model with baseline, where all the facotrs are in. model2: the model after training, with the facotrs selected. model3: it is based on model2, the function of model3 is to figure out the ‘challenge’ to complete the competition.

The baseline model contained 10 internal characteristic features: age, storyheigh, heatedare, numfirepla, totalac, fullbaths, halfbaths, bedrooms, units, prisqft and ,as well as 18 amenity features. Plus, there are also some ACS features.

set.seed(42)

inTrain <- caret::createDataPartition(
  y = paste(topredict$name,topredict$price),
  p = .60, list = FALSE)

# split data into training and test
model.training <- topredict[inTrain,]
model.test <- topredict[-inTrain,] 

## First model: All internal features, amenity features
model1 <- lm(price ~ ., data = as.data.frame(model.training) %>% 
           dplyr::select(colnames(factor_internal), 
                         colnames(factor_amenity),
                         colnames(factor_acs)
))

stargazer(model1, type = "html", 
           title = "Table 4.1.1 Summary Statistics of Model 1 ",
           #out = "model1.html",
           header = FALSE,
           single.row = TRUE)
## 
## <table style="text-align:center"><caption><strong>Table 4.1.1 Summary Statistics of Model 1</strong></caption>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>price</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">age</td><td>196.414<sup>***</sup> (44.941)</td></tr>
## <tr><td style="text-align:left">storyheigh</td><td>-52,716.280<sup>***</sup> (2,792.770)</td></tr>
## <tr><td style="text-align:left">heatedarea</td><td>246.397<sup>***</sup> (1.665)</td></tr>
## <tr><td style="text-align:left">numfirepla</td><td>7,391.586<sup>***</sup> (2,218.184)</td></tr>
## <tr><td style="text-align:left">totalac</td><td>-5.903 (10.168)</td></tr>
## <tr><td style="text-align:left">fullbaths</td><td>34,075.240<sup>***</sup> (2,036.145)</td></tr>
## <tr><td style="text-align:left">halfbaths</td><td>52,731.190<sup>***</sup> (2,298.652)</td></tr>
## <tr><td style="text-align:left">bedrooms</td><td>-17,253.830<sup>***</sup> (1,162.424)</td></tr>
## <tr><td style="text-align:left">units</td><td>-19,602.230<sup>***</sup> (4,620.319)</td></tr>
## <tr><td style="text-align:left">prisqft</td><td>2,116.208<sup>***</sup> (9.108)</td></tr>
## <tr><td style="text-align:left">police_nn1</td><td>-2.367<sup>***</sup> (0.329)</td></tr>
## <tr><td style="text-align:left">police_nn2</td><td>10.145<sup>***</sup> (0.943)</td></tr>
## <tr><td style="text-align:left">police_nn3</td><td>-10.714<sup>***</sup> (0.755)</td></tr>
## <tr><td style="text-align:left">school_nn1</td><td>-4.643<sup>***</sup> (0.986)</td></tr>
## <tr><td style="text-align:left">school_nn2</td><td>14.139<sup>***</sup> (2.184)</td></tr>
## <tr><td style="text-align:left">school_nn3</td><td>-8.604<sup>***</sup> (1.677)</td></tr>
## <tr><td style="text-align:left">church_nn1</td><td>-10.381<sup>***</sup> (1.824)</td></tr>
## <tr><td style="text-align:left">church_nn2</td><td>18.965<sup>***</sup> (4.045)</td></tr>
## <tr><td style="text-align:left">church_nn3</td><td>-17.203<sup>***</sup> (2.954)</td></tr>
## <tr><td style="text-align:left">park_nn1</td><td>-2.902<sup>**</sup> (1.222)</td></tr>
## <tr><td style="text-align:left">park_nn2</td><td>2.446 (2.795)</td></tr>
## <tr><td style="text-align:left">park_nn3</td><td>5.383<sup>***</sup> (2.052)</td></tr>
## <tr><td style="text-align:left">firestation_nn1</td><td>0.830 (0.520)</td></tr>
## <tr><td style="text-align:left">firestation_nn2</td><td>6.063<sup>***</sup> (1.343)</td></tr>
## <tr><td style="text-align:left">firestation_nn3</td><td>-3.137<sup>***</sup> (1.117)</td></tr>
## <tr><td style="text-align:left">medical_nn1</td><td>2.335<sup>*</sup> (1.399)</td></tr>
## <tr><td style="text-align:left">medical_nn2</td><td>-12.208<sup>***</sup> (3.730)</td></tr>
## <tr><td style="text-align:left">medical_nn3</td><td>8.880<sup>***</sup> (2.691)</td></tr>
## <tr><td style="text-align:left">pctWhite</td><td>-30,834.580<sup>***</sup> (6,100.534)</td></tr>
## <tr><td style="text-align:left">pctBachelors</td><td>-735,300.000<sup>***</sup> (78,277.770)</td></tr>
## <tr><td style="text-align:left">pctPoverty</td><td>134,782.000<sup>***</sup> (16,467.320)</td></tr>
## <tr><td style="text-align:left">TotalPop</td><td>-2.924 (6.927)</td></tr>
## <tr><td style="text-align:left">MedHHInc</td><td>1.112<sup>***</sup> (0.044)</td></tr>
## <tr><td style="text-align:left">Singleadult</td><td>165.311<sup>***</sup> (10.405)</td></tr>
## <tr><td style="text-align:left">Healthins</td><td>21.635<sup>***</sup> (6.186)</td></tr>
## <tr><td style="text-align:left">Familyincome</td><td>-176.713<sup>***</sup> (19.010)</td></tr>
## <tr><td style="text-align:left">Housingunits</td><td>77.352<sup>***</sup> (15.278)</td></tr>
## <tr><td style="text-align:left">Househeatingfuel</td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>-511,230.600<sup>***</sup> (10,147.000)</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>27,781</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.870</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.870</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>152,337.000 (df = 27743)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>5,006.904<sup>***</sup> (df = 37; 27743)</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

4.2 Table of results (training set)

In this part, we study the Summary Statistics of Model 1 to select variables to build our final model. We mainly considered p-values in determining the correlation of variables and removed the values with larger p-values. The R2 is 0.615.

model2 <- lm(price ~ ., data = as.data.frame(model.training) %>% 
           dplyr::select(# internal features
                         price, age, storyheigh,heatedarea, #prisqft,
                         numfirepla, fullbaths,halfbaths,bedrooms,
                         units,
                         # amenity features
                         police_nn1,police_nn2,police_nn3,school_nn1,
                         school_nn2,school_nn3,church_nn1,church_nn2,
                         church_nn3,park_nn3,firestation_nn2,
                         firestation_nn3,medical_nn1,
                         #acs
                         pctWhite,pctBachelors,pctPoverty,MedHHInc,
                         Singleadult,Healthins,Familyincome
                         )
          
)
model.test <-
  model.test %>%
  mutate(SalePrice.Predict = predict(model2, model.test),
         SalePrice.Error = SalePrice.Predict - price,
         SalePrice.AbsError = abs(SalePrice.Predict - price),
         SalePrice.APE = (abs(SalePrice.Predict - price)) / SalePrice.Predict)%>%
  filter(price < 5000000)
stargazer(model2, type = "html", 
          title = "Table 4.1.2 Summary Statistics of Model 2 ",
          header = FALSE,
          #out = "model2.html",
          single.row = TRUE)
## 
## <table style="text-align:center"><caption><strong>Table 4.1.2 Summary Statistics of Model 2</strong></caption>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>price</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">age</td><td>1,008.386<sup>***</sup> (76.974)</td></tr>
## <tr><td style="text-align:left">storyheigh</td><td>-106,580.100<sup>***</sup> (4,782.411)</td></tr>
## <tr><td style="text-align:left">heatedarea</td><td>232.907<sup>***</sup> (2.855)</td></tr>
## <tr><td style="text-align:left">numfirepla</td><td>12,216.050<sup>***</sup> (3,803.831)</td></tr>
## <tr><td style="text-align:left">fullbaths</td><td>77,329.650<sup>***</sup> (3,483.236)</td></tr>
## <tr><td style="text-align:left">halfbaths</td><td>71,439.500<sup>***</sup> (3,945.878)</td></tr>
## <tr><td style="text-align:left">bedrooms</td><td>-35,406.460<sup>***</sup> (1,994.080)</td></tr>
## <tr><td style="text-align:left">units</td><td>-77,989.680<sup>***</sup> (7,926.400)</td></tr>
## <tr><td style="text-align:left">police_nn1</td><td>-4.881<sup>***</sup> (0.548)</td></tr>
## <tr><td style="text-align:left">police_nn2</td><td>26.226<sup>***</sup> (1.584)</td></tr>
## <tr><td style="text-align:left">police_nn3</td><td>-27.541<sup>***</sup> (1.271)</td></tr>
## <tr><td style="text-align:left">school_nn1</td><td>-0.902 (1.685)</td></tr>
## <tr><td style="text-align:left">school_nn2</td><td>19.853<sup>***</sup> (3.744)</td></tr>
## <tr><td style="text-align:left">school_nn3</td><td>-14.971<sup>***</sup> (2.865)</td></tr>
## <tr><td style="text-align:left">church_nn1</td><td>-18.165<sup>***</sup> (3.132)</td></tr>
## <tr><td style="text-align:left">church_nn2</td><td>42.758<sup>***</sup> (6.943)</td></tr>
## <tr><td style="text-align:left">church_nn3</td><td>-42.666<sup>***</sup> (5.050)</td></tr>
## <tr><td style="text-align:left">park_nn3</td><td>3.775<sup>***</sup> (0.750)</td></tr>
## <tr><td style="text-align:left">firestation_nn2</td><td>15.774<sup>***</sup> (1.624)</td></tr>
## <tr><td style="text-align:left">firestation_nn3</td><td>-9.389<sup>***</sup> (1.780)</td></tr>
## <tr><td style="text-align:left">medical_nn1</td><td>-3.200<sup>***</sup> (0.494)</td></tr>
## <tr><td style="text-align:left">pctWhite</td><td>123,110.400<sup>***</sup> (9,612.447)</td></tr>
## <tr><td style="text-align:left">pctBachelors</td><td>-213,412.600 (132,200.600)</td></tr>
## <tr><td style="text-align:left">pctPoverty</td><td>463,442.000<sup>***</sup> (27,142.020)</td></tr>
## <tr><td style="text-align:left">MedHHInc</td><td>2.936<sup>***</sup> (0.074)</td></tr>
## <tr><td style="text-align:left">Singleadult</td><td>299.515<sup>***</sup> (16.191)</td></tr>
## <tr><td style="text-align:left">Healthins</td><td>-3.106 (4.382)</td></tr>
## <tr><td style="text-align:left">Familyincome</td><td>-70.156<sup>***</sup> (15.601)</td></tr>
## <tr><td style="text-align:left">Constant</td><td>-112,409.900<sup>***</sup> (16,245.090)</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>27,781</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.615</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.614</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>262,002.900 (df = 27752)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>1,580.537<sup>***</sup> (df = 28; 27752)</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

4.3 Table of goodness of fit (test set)

The Mean Absolute Errors (MAE) and Mean Absolute Percent Errors (MAPE) are statistical measures of how accurate a forecast system is.Based on the following table, the MAE is 125091.8, the errors are accounted for 0.4281398.

test_fit<-as.data.frame(cbind(mean(model.test$SalePrice.AbsError,na.rm = T),
                              mean(model.test$SalePrice.APE,na.rm = T)))
colnames(test_fit)<-c("Mean Absolute Errors (MAE)","MAPE")
kable(test_fit,
      caption = 'Table RESULT. MAE and MAPE for Test Set') %>%
  kable_styling("striped", full_width = F)
Table RESULT. MAE and MAPE for Test Set
Mean Absolute Errors (MAE) MAPE
125091.8 0.4281398

4.4 Cross-Validation Results

Note that R2 judges model accuracy on the data used to train the model. Cross-validation ensures that the goodness of fit results for a single hold out is not a fluke. While there are many forms of cross-validation.

This section conducted cross-validation to assess how our model would generalize to other independent data sets, in which we could flag problem of overfitting or feature selection bias.

Both a summary table and distribution plots of R-square, RMSE, and MAE in each of 100 fold are shown below.The cross-validation output provides very important goodness of fit information. The value of each metric is actually the mean value across all folds. The train function returns many objects (names(reg.cv)), one of which is resample which provides goodness of fit for each of the 100 folds.

The variation of 100 folds can also be visualized with a histogram of across-fold MAE. Although most of the errors are clustering tightly together, some errors are scattered, indicating the inconsistency of the model prediction.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(42)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(topredict) %>% 
          dplyr::select(price, age, storyheigh,heatedarea, prisqft,
                         numfirepla, fullbaths,halfbaths,bedrooms,
                         units,
                         # amenity features
                         police_nn1,police_nn2,police_nn3,school_nn1,
                         school_nn2,school_nn3,church_nn1,church_nn2,
                         church_nn3,park_nn1,park_nn3,firestation_nn2,
                         firestation_nn3,medical_nn2,medical_nn3,
                         #acs
                         pctWhite,pctBachelors,pctPoverty,MedHHInc,
                         Singleadult,Healthins,Familyincome),
        method = "lm", trControl = fitControl, na.action = na.pass)
kable(reg.cv$resample,
          caption = 'Table RESULT. Cross-validation Test: Summary of RMSE, R Squared and MAE') %>%
  kable_styling("striped", full_width = F)
Table RESULT. Cross-validation Test: Summary of RMSE, R Squared and MAE
RMSE Rsquared MAE Resample
122504.66 0.8743783 64995.85 Fold001
124896.26 0.8588380 68202.16 Fold002
161994.86 0.8501994 73482.32 Fold003
164450.47 0.8695818 68036.35 Fold004
232355.88 0.7604469 87745.75 Fold005
107231.59 0.8852886 58874.35 Fold006
113517.04 0.9099313 69629.86 Fold007
169506.55 0.8612415 66351.18 Fold008
124349.27 0.9020569 65360.79 Fold009
137400.39 0.8561117 71727.48 Fold010
93829.79 0.8977372 65428.13 Fold011
179934.58 0.8984982 74901.61 Fold012
154884.94 0.8054865 67446.97 Fold013
132590.30 0.8768806 66809.67 Fold014
251280.14 0.8101429 80274.94 Fold015
109039.12 0.8924788 69736.76 Fold016
92859.32 0.9117293 60955.78 Fold017
124155.40 0.9281102 71874.20 Fold018
100533.15 0.9415016 65049.85 Fold019
179268.54 0.8178322 74135.68 Fold020
116399.02 0.9112885 64687.85 Fold021
128788.02 0.8563619 69126.81 Fold022
104117.79 0.8871478 64615.74 Fold023
146810.39 0.8604731 70608.83 Fold024
158307.23 0.9063928 80484.43 Fold025
129193.00 0.8839831 73458.36 Fold026
125955.56 0.8920068 69655.12 Fold027
118312.84 0.9095948 69633.09 Fold028
198265.11 0.8501643 76105.95 Fold029
108238.14 0.8852280 65449.00 Fold030
109046.91 0.9447907 63812.01 Fold031
126490.16 0.8707950 67027.19 Fold032
87321.48 0.9163583 58679.07 Fold033
106528.26 0.9058978 65071.24 Fold034
157714.06 0.8805110 81052.13 Fold035
235190.04 0.8580654 76437.91 Fold036
127674.15 0.9228530 67403.00 Fold037
233191.85 0.8386995 74514.32 Fold038
97814.49 0.8952665 63047.65 Fold039
102358.01 0.9093912 66173.10 Fold040
112505.58 0.9170178 67778.25 Fold041
162883.67 0.8892207 73312.85 Fold042
98120.88 0.9074506 66002.77 Fold043
110778.78 0.8861272 69547.06 Fold044
98891.90 0.8861429 63566.78 Fold045
144205.06 0.8711989 68590.85 Fold046
157739.82 0.8456615 69601.91 Fold047
125412.27 0.8972081 70654.62 Fold048
110048.23 0.8957732 64754.34 Fold049
190007.52 0.8005864 72659.83 Fold050
104048.33 0.9186885 60732.56 Fold051
125236.13 0.8781373 73615.91 Fold052
101578.30 0.9005791 67809.58 Fold053
123969.59 0.8846029 71353.91 Fold054
117987.88 0.8924390 69108.59 Fold055
159625.63 0.8318301 71377.70 Fold056
123243.97 0.8862276 71867.86 Fold057
108392.98 0.8762266 67567.22 Fold058
117386.06 0.9184440 68697.02 Fold059
109793.58 0.8552744 66327.24 Fold060
144880.62 0.8488145 73357.18 Fold061
136770.47 0.8723256 74378.98 Fold062
88753.19 0.9049809 59530.08 Fold063
95735.44 0.9083808 63364.06 Fold064
120142.18 0.8995945 76963.70 Fold065
105672.02 0.8756720 64940.74 Fold066
111164.83 0.9059645 68931.19 Fold067
153613.46 0.8584548 69407.24 Fold068
103301.61 0.8895758 64793.44 Fold069
175748.89 0.8517784 74892.60 Fold070
101318.78 0.8848949 65510.83 Fold071
207164.14 0.8239167 77540.45 Fold072
137219.06 0.8898394 68582.86 Fold073
163950.27 0.8747377 72493.63 Fold074
176636.86 0.8256292 68399.89 Fold075
114895.26 0.8866002 67287.00 Fold076
169918.35 0.8645435 76566.94 Fold077
96132.56 0.8980111 63998.02 Fold078
113673.02 0.8779258 70226.71 Fold079
105753.26 0.8886117 64894.23 Fold080
111718.72 0.8957923 65028.89 Fold081
192409.31 0.8389409 74206.22 Fold082
167120.32 0.8679844 74065.45 Fold083
133460.93 0.8811429 72745.85 Fold084
145402.01 0.8708571 68966.66 Fold085
107468.71 0.9044587 62019.22 Fold086
144057.51 0.8719382 67479.74 Fold087
145949.43 0.8836133 64587.22 Fold088
127397.63 0.8749113 64681.93 Fold089
122381.26 0.8843794 69220.71 Fold090
170742.17 0.8806996 77403.90 Fold091
171960.36 0.8423798 80525.62 Fold092
132629.81 0.8851896 63103.04 Fold093
113658.81 0.8739056 60451.39 Fold094
126422.42 0.9071228 67514.70 Fold095
119956.37 0.9051804 68877.02 Fold096
115539.32 0.8793529 65974.18 Fold097
121882.34 0.8707135 63871.41 Fold098
106684.54 0.9184723 65713.60 Fold099
226283.68 0.7269545 74794.28 Fold100
ggplot(data = reg.cv$resample) +
  geom_histogram(aes(x = reg.cv$resample$MAE), fill = 'blue') +
  labs(title="Distribution of Cross-validation MAE",
       subtitle = "K = 100\n",
       caption = "Figure RESULT") +
  xlab('Mean Absolute Error') +
  ylab('Count') +
  plotTheme()

reg.cv$resample %>% 
  pivot_longer(-Resample) %>% 
  mutate(name = as.factor(name)) %>% 
  ggplot(., aes(x = name, y = value, color = name)) +
  geom_jitter(width = 0.1) +
  facet_wrap(~name, ncol = 3, scales = "free") +
  theme_bw() +
  theme(legend.position = "none") +
  labs(title = 'Cross-validation Test: Distribution of MAE, RMSE, R Squared\n',
       caption = "Figure RESULT") +
  plotTheme()

4.5 Pred vs Obs Scatterplot

/#缺一张图!

4.6 Map of residuals for our test:setResiduals Map w/ Moran’s I/plot of the spatial lag in errors

lIn this section, we focus on verifying whether the spatial distribution will have an impact on our model. We mainly use the following approaches:

residual maps: It shows the residuals in different places in Mecklenberg County, and we can also observed the distribution of residuals. /#缺失# MoranTest:An approach for measuring spatial autocorrelation. In the picture, the frequency of all 999 randomly permutated I are plotted as a histogram with the Observed I indicated by the orange line. That the observed I is higher then all of the 999 randomly generated, so it provides visual confirmation of spatial autocorrelation.

spatial lag in errors: A spatial lag is a variable that averages the neighboring values of a location. #可能的结论,但是缺失了The spatial lag of error plot shows that as home price errors increase, the nearby home price errors slightly decrease, indicating that our model errors are rarely spatially autocorrelated.

coords <- st_coordinates(all.sf) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

all.sf$lagPrice <- lag.listw(spatialWeights, all.sf$price)

coords.test <-  st_coordinates(model.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

model.test %>%
  mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error,NAOK =TRUE))
## Simple feature collection with 17380 features and 47 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1385257 ymin: 465535.9 xmax: 1534240 ymax: 645589.8
## Projected CRS: NAD83 / North Carolina (ftUS)
## First 10 features:
##    commonpid   price age storyheigh heatedarea numfirepla totalac fullbaths
## 4   00101323 1125000  31        1.0       4272          1   0.000         4
## 8   00101334  925000  25        1.5       2736          1   0.500         2
## 10  00101341  250000  42        1.0       1242          0   0.000         1
## 16  00101461  422000  30        1.5       2522          1   0.000         2
## 18  00101488 1135000   5        2.5       5031          1   0.557         5
## 20  00102211  675000  47        2.0       2878          1   0.981         3
## 21  00102218  272000 112        1.0       1312          1   2.440         1
## 24  00102303  835000  50        1.0       1923          0   0.000         3
## 27  00102320  738000  57        2.0       3400          1   0.595         3
## 35  00103210 1400000  47        1.0       3509          1   0.390         3
##    halfbaths bedrooms units toPredict  prisqft police_nn1 police_nn2 police_nn3
## 4          0        4     1 MODELLING 263.3427   42814.36   56382.09   61783.44
## 8          1        3     1 MODELLING 338.0848   43838.33   57270.03   62745.67
## 10         1        3     1 MODELLING 201.2882   43401.44   56697.05   62264.28
## 16         1        4     1 MODELLING 167.3275   39870.78   53636.06   58950.12
## 18         1        6     1 MODELLING 225.6013   41747.08   55156.49   60674.73
## 20         0        5     1 MODELLING 234.5379   38333.27   52544.99   57566.35
## 21         0        2     1 MODELLING 207.3171   38064.68   52195.36   57282.52
## 24         0        2     1 MODELLING 434.2174   38205.51   52461.10   57451.44
## 27         0        4     1 MODELLING 217.0588   37622.30   51970.70   56900.56
## 35         0        4     1 MODELLING 398.9741   41321.34   55206.74   60587.25
##    school_nn1 school_nn2 school_nn3 park_nn1 park_nn2  park_nn3 firestation_nn1
## 4    12967.26   13487.88   15250.39 8995.086 9005.873  9384.504        44309.99
## 8    13634.94   14152.18   16085.28 8982.734 9559.382 10150.882        44474.41
## 10   12901.07   13414.79   15678.48 8020.579 9286.375 10029.376        43526.92
## 16   10767.32   11285.37   12947.53 7075.119 7306.465  7766.007        43021.73
## 18   11522.54   12041.60   14288.38 7546.496 8461.721  8992.145        42784.18
## 20   11183.35   11671.73   12269.87 4423.064 4586.123  5467.027        44035.66
## 21   10610.37   11101.54   11974.64 4937.122 4960.590  5728.464        43445.71
## 24   11278.08   11762.25   12239.55 4167.614 4324.989  5217.078        44159.43
## 27   11258.22   11729.04   11973.52 3704.216 3721.886  4570.242        44072.49
## 35   11776.25   12793.94   13691.11 3435.516 4900.983  5423.533        47509.64
##    firestation_nn2 firestation_nn3 church_nn1 church_nn2 church_nn3 medical_nn1
## 4         45395.03        46726.07   1930.098   4460.378   5766.272   13979.465
## 8         45758.44        47314.53   2349.786   5225.169   6501.894   14919.261
## 10        44875.53        46586.55   1701.240   4787.231   5974.452   15563.501
## 16        43667.32        44587.77   2822.831   3407.540   4379.487   12731.629
## 18        43888.94        45375.06    473.562   3331.972   4532.723   14730.181
## 20        44172.52        44385.66   3043.236   4166.987   4659.616   10342.670
## 21        43619.21        43934.06   2527.406   3756.665   4296.421   10898.500
## 24        44247.76        44390.04   3093.859   4145.308   4737.170   10110.304
## 27        44088.93        44125.94   3041.189   3881.930   4800.328    9728.320
## 35        48418.87        48924.48   3048.513   3317.797   5266.447    4424.001
##    medical_nn2 medical_nn3          vfd  pctWhite pctBachelors pctPoverty
## 4    14862.766   15231.292 Huntersville 0.9892357            0 0.05992106
## 8    15863.350   16249.079 Huntersville 0.9892357            0 0.05992106
## 10   16438.250   16805.117 Huntersville 0.9892357            0 0.05992106
## 16   13333.936   13622.015 Huntersville 0.9892357            0 0.05992106
## 18   15475.060   15805.081 Huntersville 0.9892357            0 0.05992106
## 20   10786.610   11026.666 Huntersville 0.9892357            0 0.05992106
## 21   11297.811   11525.131 Huntersville 0.9892357            0 0.05992106
## 24   10536.698   10771.367 Huntersville 0.9892357            0 0.05992106
## 27   10061.893   10268.656 Huntersville 0.9892357            0 0.05992106
## 35    5651.929    6101.601    Cornelius 0.9892357            0 0.05992106
##    TotalPop MedHHInc Singleadult Healthins Familyincome Housingunits
## 4      2787    92038         565      2787         1408         1709
## 8      2787    92038         565      2787         1408         1709
## 10     2787    92038         565      2787         1408         1709
## 16     2787    92038         565      2787         1408         1709
## 18     2787    92038         565      2787         1408         1709
## 20     2787    92038         565      2787         1408         1709
## 21     2787    92038         565      2787         1408         1709
## 24     2787    92038         565      2787         1408         1709
## 27     2787    92038         565      2787         1408         1709
## 35     2787    92038         565      2787         1408         1709
##    Househeatingfuel                 geometry SalePrice.Predict SalePrice.Error
## 4              1408   POINT (1423727 618613)         1164297.7        39297.68
## 8              1408   POINT (1422602 619021)          685058.3      -239941.71
## 10             1408   POINT (1422217 618131)          316404.3        66404.27
## 16             1408 POINT (1426036 616681.8)          620949.9       198949.93
## 18             1408 POINT (1423547 617088.9)         1239420.2       104420.20
## 20             1408 POINT (1428858 616825.3)          697990.8        22990.77
## 21             1408 POINT (1428571 616301.7)          452560.2       180560.22
## 24             1408 POINT (1429134 616856.6)          678716.5      -156283.52
## 27             1408 POINT (1429850 616630.1)          858856.2       120856.22
## 35             1408 POINT (1432562 622381.3)          940280.8      -459719.18
##    SalePrice.AbsError SalePrice.APE lagPriceError
## 4            39297.68    0.03375226     51046.144
## 8           239941.71    0.35025006    106894.023
## 10           66404.27    0.20987159     45624.828
## 16          198949.93    0.32039609     90785.536
## 18          104420.20    0.08424923    104779.159
## 20           22990.77    0.03293851     23626.126
## 21          180560.22    0.39897502    -27631.198
## 24          156283.52    0.23026333     59480.984
## 27          120856.22    0.14071764      4053.038
## 35          459719.18    0.48891690   -495094.125
#morantest

moranTest <- moran.mc(model.test$SalePrice.Error, zero.policy=TRUE, #zero.policy make code runs, but the neighbor is empty
                      spatialWeights.test, nsim = 999,na.action=na.omit)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

4.7 Plot predicted prices as a function of observed prices

Regression SalePrice.AbsError SalePrice.APE
Baseline Regression 125091.8 0.4281398
Neighborhood Effects 125118.4 0.3317715
bothRegressions %>%
  dplyr::select(SalePrice.Predict, price, Regression) %>%
    ggplot(aes(price, SalePrice.Predict)) +
  geom_point() +
  stat_smooth(aes(price, price), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(SalePrice.Predict, price), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  plotTheme()

4.8 Map of MAPE by neighborhood

/#表错了,应该加上MAPE Indicated in the summary table as well as the choropleth map below, MAPE varies across neighborhoods (only 22 neighborhoods left in the test set). 缺结论

vfd meanPrice meanPrediction
Carolina 351840.3 351840.3
Charlotte 424147.2 424147.2
Charlotte Rural 332920.2 332920.2
Cook’s 385791.7 385791.7
Cornelius 624048.7 624048.7
Cornelius Rural 527500.0 527500.0
Davidson 587246.3 587246.3
Davidson Rural 765058.8 765058.8
Hemby Bridge 448000.0 448000.0
Huntersville 442093.2 442093.2
Huntersville Rural 665235.3 665235.3
Idlewild 276348.5 276348.5
Long Creek 298238.6 298238.6
Matthews 421957.7 421957.7
Midland 340500.0 340500.0
Mint Hill 399338.2 399338.2
Mint Hill Rural 381580.5 381580.5
Pineville 369035.3 369035.3
Robinson 319331.6 319331.6
Steele Creek #1 397110.0 397110.0
Steele Creek #2 500462.2 500462.2
West Mecklenburg 350218.5 350218.5
st_drop_geometry(bothRegressions) %>%
  group_by(Regression, vfd) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(nhood.sf) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = mean.MAPE)) +
      geom_sf(data = bothRegressions, colour = "black", size = .05) +
      facet_wrap(~Regression) +
      scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood") +
      mapTheme()

4.9 Predicted Map (Test Set)

ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = model.test, aes(colour = q5(prisqft)), 
          show.legend = "point", size = .6) +
  scale_colour_manual(values = palette5,
                   labels=qBr(model.test,"prisqft"),
                   name="Quintile\nBreaks") +
  labs(title="Price predicted per sqft, Mecklenburg County") +
  mapTheme()

做过了## MAPE map by neighborhood

4.10 Scatterplot - MAPE by neighborhood mean price

nhood_sum <- model.test %>% 
  group_by(vfd) %>%
  summarize(meanPrice = mean(price, na.rm = T),
            meanPrediction = mean(SalePrice.Predict, na.rm = T),
            meanMAE = mean(SalePrice.AbsError, na.rm = T),
            meanMAPE = mean(SalePrice.APE, na.rm = T)) 
MAPE.nhood.plot <-ggplot()+
  geom_point(data = nhood_sum, aes(x = meanPrice, y = meanMAPE)) +
  stat_smooth(data = nhood_sum, aes(x = meanPrice, y = meanMAPE), method = "loess", se = FALSE, size = 1, colour="red") + 
  labs(title = "MAPE by Neighborhood as a Function of\nMean Price by Neighborhood\n",
       caption = 'Figure RESULT 9') +
  xlab('Mean Price by Neighborhood') +
  ylab('Mean MAPE by Neighborhood') +
  plotTheme()



MAPE.nhood.plot

4.11 Split by Census Groups

In this part, we chose two variables:race, income and plot them by census tract.The first map shows that Mecklenburg County is racially divided, with predominantly white neighborhoods to the north and south, which are also where housing prices are higher. The lower income groups are scattered in the more central parts of Mecklenburg County.

tracts20 <- tracts20 %>%
  mutate(percentWhite = Whites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(MedHHInc > 32322, "High Income", "Low Income"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(tracts20), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(tracts20), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

st_join(bothRegressions, tracts20) %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood racial context")
Test set MAPE by neighborhood racial context
Regression Majority Non-White Majority White
Baseline Regression 55% 33%
Neighborhood Effects 37% 30%
st_join(bothRegressions, tracts20) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context")
Test set MAPE by neighborhood income context
Regression High Income Low Income
Baseline Regression 43% 43%
Neighborhood Effects 33% 51%

V Discussion

Is this an effective model? What were some of the more interesting variables? How much of the variation in prices could you predict? Describe the more important features? Describe the error in your predictions? According to your maps, could you account the spatial variation in prices? Where did the model predict particularly well? Poorly? Why do you think this might be?

5.1 Accuracy

5.2 Generalizability

VI Conclusion

Would you recommend your model to Zillow? Why or why not? How might you improve this model?

VII Challenge